Guidelines for Preserving New Forms of Scholarship

Guidelines

Embedded enhanced features, especially those that link to resources outside of the publication or use an unusual format, are at the highest risk of failing in the future. For this reason, a meaningful caption is vital for providing clues to future readers about what they should expect to find in that location in the text, and preferably some means of finding it and accessing it. Ideally, this caption would include a title, source, unique persistent identifier (e.g. DOI, ARK ID, or Handle), and a link to an archived copy if different from the identifier. Though any link could ultimately fail, this information would at least provide clues to where the user might find an archived copy. When creating captions, apply the standards available within the format you are using to support automated parsing. For example, HTML5 has the <figure> and <figcaption> elements. “Alt'' tags are also widely used to supply context if a feature cannot be viewed. In this respect, a meaningful caption may also meet standards for digital accessibility.

Where non-text features are supplied as separate publication resources, this guideline may also be relevant:
24. Create metadata for each publication resource

Some platforms support assigning each publication resource its own descriptive metadata and landing page making it possible to cite them independently of the text as a whole. In these cases, if the publisher has the capacity to assign unique persistent identifiers such as valid DOIs, ARK IDs, or handles to each publication resource and to provide this as part of the metadata, this can help maintain connections between the components of a publication and sustain citation links. As an example, consider the case where a video is embedded in an EPUB and it has a caption under it that includes a registered DOI. The DOI points to a page dedicated to the published video. If the publisher no longer has that material, a preservation service may have the option to register the location of its preservation copy with doi.org so that the link would point to a new location. If a resource is local to the publication and is not intended to be cited or described independently, then a meaningful caption provides useful context, but creating persistent identifiers isn’t necessary.

These guidelines also relate to the use of identifiers:
17. Use persistent identifiers to link or cite external resources
24. Create descriptive metadata for each publication resource, include identifiers
31. Assign persistent identifiers to significant versions

Correct handling of character encoding can make an enormous difference to whether a publication is properly rendered. Encoding type should be expressed in the metadata, and/or within the publication as appropriate for the format. For example, websites may include encoding in the metatags and/or the charset property of the HTTP headers.

A preservation service may not collect web content outside of the agreed upon domain names unless copyright for the content being harvested is clear. If third-party pages and features that are visually embedded in an EPUB or a web-based publication are meant to be preserved, it should be possible to identify which content publishers have the right to collect so that a web crawler can be configured to include or exclude it. One way to differentiate could be to consistently express the rights in the metadata that is supplied to the preservation service. Another option is to apply structured metadata describing the rights status to the HTML. The Creative Commons REL documentation includes examples of this that cover both page- and object-level licenses - this approach could support automated harvesting decisions at either level. Alternatively, a publisher could supply a list of domain names to include for harvest during the initial preservation workflow configuration.

These guidelines may also be useful to consider when embedding external web content:
25. Add license information to resource-level metadata
38. List the URLs for external web content in the metadata
45. Embed metadata that includes a license in the <head> of a web page

An HTML iframe can contain a wide range of types of content, from a wide range of sources, which makes them a challenge for preservation. The quality of automated website archiving in general can vary greatly. If an iframe is embedded in an EPUB or website, the more inconsistent, complex, and dynamic their content, the more likely they will be lost in an automated process. If these features are important to preserve, consider a manual process to capture and package the intellectual components of the iframe content in another form. For example, a video or screenshot with a caption that links to the website might be a sufficient fallback for conveying the contents of the iframe.

These guidelines may also be relevant to use of iframes:
38. List the URLs for each embedded iframe in the metadata
39. Avoid use of iframes in EPUBs
42. Facilitate a local web archiving workflow to support iframes

Sitemaps containing links to all of the content in a website ensure that website archiving crawlers will be able to locate all of the content. Doing so may also improve search engine optimization. Sitemaps that are intended to facilitate web archiving should include links for all texts, resource landing pages, downloads, and views of the data i.e. API URLs that are called dynamically while the user is interacting with the page and each combination of query parameters that may appear.

This guideline will make creating a sitemap simpler:
46. For websites, give each page state its own URL

Successful website archiving is contingent on a harvester visiting each URL that forms the work. If a full list of URLs is not supplied to the harvesting tool via a sitemap or through other configuration, automation may be used to discover the URLs. Automated website crawling tools can easily identify the target of simple HTML <a> or <link> tags with a relative or full URL, and will include them in a crawl. Many websites, however, use JavaScript actions to fetch content. Crawlers may not be able to identify the URLs that are loaded by JavaScript causing the content to be missed during an automated archiving process. Similarly, hyperlinks that are within compiled features e.g. compiled 3D visualizations, can be difficult or impossible for a crawler to discover. When designing web content, consider the value of using simple HTML links so that crawlers can identify the URLs that make up a work. Note that as with <link> tags, the target URLs of <a> tags will likely be crawled even if they do not display text on the page, and so they can be used to guide a crawler to relevant content. Conversely, a crawler cannot determine which of these tags link to content that is not vital to the work, and so using these tags for other purposes or having hidden link tags that are never used can guide the crawler to things that may be out of scope for an archived copy of the publication, such as previous or unused iterations of a page.

This guideline may make changes for efficient crawling less critical:
43. Include a sitemap for all web-based publications

This can help facilitate a fully automated web harvest of content in situations where an export is not a feasible approach. Bibliographic metadata is a vital component of a publication preservation package. As with other metadata it’s best to use a broadly adopted standard such as Google Scholar, Dublin Core, or PRISM. Cover the core bibliographic information to make the publication findable, and be consistent. An expression of the material’s license, for example, through <link rel="license" href=...>, is valuable since this can support an archive’s understanding of whether the material can be preserved and how it can be reused. Note that HTTP Link headers can also be used to convey some metadata and can be applied to the HTTP Response of both HTML and non-HTML web resources. An approach to this is described on signposting.org.

These guidelines may also be relevant when generating bibliographic metadata:
21. Provide bibliographic metadata with exported publications
30. Bibliographic metadata in the context of EPUBs
40. The license for external resources can be expressed in HTML

Data driven websites can technically display different sets of resources from the server at the same URL. If different views of a page share the same URL, however, this means that retrieving a web page from a web archive could have unpredictable results. It is therefore helpful to ensure that, where reasonable, the URL reflects any filters or properties that change what is loaded into the browser from the server via the path or querystring (the part of the URL following the question mark). This allows the different states of a page to be bookmarked, but also makes it possible to utilize a sitemap to express the full range of resources that make up the website. While a sitemap can include API calls that might be used for dynamically generated views, sitemaps are easier to maintain if these views are also reflected in the browser’s address bar.

This is another guideline about making URLs web archive friendly:
49. Parameters should not be added to the URL unnecessarily

Key URLs for a publication, such as a publication’s home page, should not change over time. If they must change, redirect the original URL to the new location. Apart from helping to decrease broken links from other websites, using a well planned URL structure can help with website preservation. Ensuring the publication’s URL does not change over time can make it easier to manage and connect different versions of the publication that are preserved and avoid duplication.

These guidelines discuss identifiers, another way to support URL persistence:
27. Persistent identifiers can be used at the publication resource level
31. Persistent identifiers should assigned to new versions of the work

Where there are multiple publications on the same domain or subdomain, and each one spans multiple pages, using a consistent and hierarchical naming convention in the URL path helps web harvesting tools identify its scope. For example, if the publication content is organized in these directories: example.org/book-slug/text, example.org/book-slug/resources, a crawler can be set to generate an archive of the resources within the “book-slug” directory.

Website crawling and playback of web archives use URLs as unique references—this includes the query parameters (after the “?” and, for some tools, after the “#”). Adding parameters to the URL that do not affect what data is loaded from the server, or simply reflect a default where the page is the same with or without the property, complicates the capture and playback of the web archive and bloats the size of the crawl since every URL is captured as if it is a new page even if the content is identical.

This guideline is also useful for creating web archive friendly URLs:
46. Assign each unique page state one, and only one, URL

Many modern websites depend on JavaScript to load data from the server as the user interacts with the site creating a dynamic experience. This can make it difficult for a web crawler to automatically create a functional copy of a web page since it may not be able to predict all user behaviors that pull new content from the server. Some web developers design websites using a “progressive enhancement” approach in which a baseline of functionality is supported for a variety of environments, including those with scripts disabled. Where this approach is used, the version of the site presented to the user will change if they choose to disable, or cannot support, JavaScript in their environment. They will instead see a scriptless version of the site that presents the core intellectual components of the page in a more static form. If this functionality exists or can be easily supported, it can serve as an alternative way to capture pages using web archiving in cases where the full dynamic version cannot be crawled automatically.

This guideline describes an alternative way to manage JavaScript-rich features:
53. For dynamic web page features, favor designs that pre-load data

Linking to media that is hosted on YouTube or Vimeo is a threat to platform and content longevity, especially for media that is owned or managed by third parties. In order to mitigate against future link rot and the general instability of archiving streamed content, where appropriate (technically and legally), host a local copy of any media assets and embed it in the web page using standard HTML5 media tags. In order to keep the overall size of embedded media manageable for access and for the purpose of web archiving, it may be advantageous to embed lower quality copies of the media and link to higher resolution versions via persistent links such as DOIs.

See also:
12. Start discussions about multimedia features early
14. Avoid depending on externally hosted web services

Platforms with good search engine optimization implement paths to navigate every page via links. This is also useful for web archiving since both search-engine crawlers and web-archiving crawlers use similar mechanisms to discover all pages of content.

These guidelines also help a website crawler discover all content:
43. A sitemap can help website crawlers reach unlinked content
44. Use simple links to help a website crawler find content
46. Ensure each page state has its own unique URL

In order to improve the likelihood that content published to the web will be able to be captured via web archiving methods, developers could preload any content that would otherwise depend on user interactions. For example, rather than repeatedly making small API calls as the user interacts with a feature, if the dataset that supports the feature is small enough, load the data as a JSON file when the page loads so that further server calls are not necessary.

This guidelines describes another approach:
50. Consider a “progressive enhancement” design to support a scriptless environment

Avoid using the “embed” option to insert a social media post into your publication. This can be unstable for preservation and for long-term sustainability since posts or accounts may be deleted. If the social media post is integral to the work, consider first taking a screenshot that can be embedded into the publication as an image. Underneath, a caption should indicate the origin of the post. Finally, use a web archive service such as archive.today or Internet Archive’s Save Page Now service to create a copy of the post—be sure to test the results, since archiving social media posts can be unreliable. The two links (live and archived) could be referenced as a citation or footnote depending on local practices.

These guidelines are also relevant to embedding social media posts in a publication:
8. Ensure terms of service cover preservation of data in third-party services
14. Avoid depending on third party services for core intellectual components
55. Consider ethical implications of embedding social media posts

Some publications, especially in a web environment, may include social media posts or user contributed content that are automatically included in the archive package, especially with a web harvesting approach. Before implementing these features or including them in publications, consider whether taking copies of them infringes on individual rights or safety. Preservation services may not be able to evaluate specific situations in a scalable way, and so it’s important to avoid including these in the preservation scope if there is uncertainty around them. This may involve designing a website in a way that certain content can easily be excluded e.g. keeping this content at a separate URL that can be skipped during crawling. The Documenting the Now project website includes information about ethical collection of social media content.

These guidelines discuss legal and technical considerations for preservation:
8. Ensure terms of service cover preservation of data in third-party services
54. Avoid embedding social media posts in a publication

Dynamic maps such as those generated with Google Maps, consist of many smaller map tiles that are loaded on the fly as users pan and zoom. Web crawlers cannot easily capture this experience, nor can this be exported. If the map is not the focal point of the work and is being used to present a small number of locations, consider using one or more still images. Display the place name and coordinates for the pin in the caption and provide a link to a live map.

These guidelines offer alternative ways to manage dynamic map features:
16. Captions add important context to non-text features
53. Consider web page designs that pre-load all data when the page loads

Some web-based features require communication with a server that is driven by an unpredictable user interaction or utilizes an open-ended number of URLs to retrieve the data to support that feature. These features cannot be exported easily due to their dependence on a live website and cannot be captured well using web archiving, which depends on identifying every unique URL. Examples include: dynamic maps (e.g. Google Maps), full text or faceted search, web forms, data visualizations (e.g. ArcGIS), IIIF image viewers, and streamed content. Some features can be redesigned to remove their dependency on a live server, but if they can’t, publishers will need to consider what can be preserved. There are many strategies for this, for example: create a simpler static version of the feature that incorporates the key features for the purpose of preservation; embed a local copy of a server based resource rather than depend on a third party service; supply code or data for the feature with documentation for re-assembling the functionality; record a video of the interaction as it behaves in the published environment for future playback; or, a combination of these.

These guidelines offer alternative ways to manage features that depend on a live server:
16. Captions add important context to non-text features
53. Consider web page designs that pre-load all data when the page loads
63. Supply raw data, documentation for data visualizations

If publishers are involved early enough in the development process for a custom web application that is being built for a single publication, they should encourage developers and authors to make choices that avoid external dependencies or to have fallback mechanisms when external dependencies fail. For example, if a connection to Google Maps fails, fall back to a still image and the vector coordinates. Developers can test their site by running it in a virtual environment with no internet connection. If it works, it is not only likely to be easier to preserve, but also much more sustainable and easier for the publisher to maintain.

These guidelines may be referred to when considering encapsulation:
14. Avoid depending on externally hosted web services
51. Embed multimedia locally
56. Avoid embedding map visualizations where a static representation would suffice

All websites have to be maintained in order to be sustained on the live web. An over-complicated web application will not only degrade more quickly and be more expensive to maintain, it will likely be even more difficult to preserve as an application. Unless the focus of the project is experimental technology, use technologies and programming languages that will be easily supported by technical staff. Do not unnecessarily overcomplicate the infrastructure and code. A helpful reference to building sustainable projects is the University of Victoria’s Endings Principles for Digital Longevity.

These guidelines may also be helpful when considering publication software:
2. General considerations for designing or selecting publication platforms
3. Favor existing standards

For custom websites or software, publishers should request an installation script from the authors or developers. This can be used in combination with a clean installation package (one that is unpolluted by extraneous files and data generated in the live environment during deployment and use) to install the software or website in a new environment. In addition to the install script, the authors or developers should provide a document listing the machine requirements and any dependencies that will be installed or used by the script. If a script is not available, at minimum the authors or developers should provide documentation that describes the requirements, dependencies, and detailed installation process with sample commands as appropriate. This information can be placed in a README file placed in the root of the project. While installation scripts may stop working as technology evolves, they provide information about how to get the software working and can be vital context for a preservation service, or when migrating to new infrastructure.

These guidelines also discuss the installation package for a web application:
61. Create installation packages for custom websites that don’t require a live server
62. Create installation packages for custom websites that do require a live server
67. Keep the source code and compiled version of the software

When a custom publication is developed using plain HTML5, CSS, and JavaScript that does not communicate with a live web server, it may be possible to run the entire application from a local machine by opening it in a browser. In this case, a clean application package should be created and retained by the publisher as a backup and for preservation. Work with the developer and author to ensure that this preservation copy: functions fully offline; does not contain any system files, server information, or logs; uses relative links that do not contain a specific domain name; and contains only local stylesheet, font, or JavaScript references. If there are features that depend on a third-party service, e.g. for search or commenting, that are not a core intellectual component of the work, these can be disabled. A README file should be placed in the root of the application folder to describe the project, instructions, dependencies, versions of technologies used, and details of any unique features that might be useful for playback later. The entire package can be stored as a zip file. If updates happen once the application is deployed on the live server, these should be reflected in the clean preservation copy and a version number should be expressed in the package.

When other methods of preserving a web publication (export, web crawling) cannot appropriately capture the important properties of a publication because it is dynamic and data-driven, a preservation institution may attempt to preserve the application itself with the goal of running it in an emulated web server environment in the future. In order to do this, the preservation institution would require a clean installation package as well as documentation of the requirements, dependencies, and installation process. A preservation copy could be created during the publication process. Work with the developer and author to ensure this preservation copy: functions fully in a self-contained web server that does not have access to any resources outside of the machine; does not contain any server information or logs; uses relative links that do not contain a specific domain name; and contains only local stylesheet, font, or JavaScript references. Where features require a live third-party site, consider a local functionality that could replace it adequately in this package. Overall, it would be beneficial for the developers of the publication to design any website with sustainability and encapsulation in mind—ensuring files are local to the application where possible and that there is a simple way to fallback to local functionality for integrations such as third-party resources.

These guidelines also discuss the installation package for a web application:
58. Consider encapsulation of custom-built web applications early
60. Request an installation script for custom software and websites
61. Produce packages for software and websites that don’t require a live server

Data visualizations tend to be a particular arrangement of one or more raw datasets. Data visualization formats can obscure parts of the underlying data that they are derived from. They may also be compiled or complex. All of these properties could potentially make the data difficult to open, validate, or comprehend in the future. To preserve a publication in which data visualizations are core intellectual components, request underlying raw data from the author. Request supporting documentation that would enable a future reader to retrace the author's steps from the raw data to the visualization. Images or videos of the visualization may also be helpful for recreating it. For both visualization and raw data formats, as with all supplements, ideally the files will be an open or broadly adopted format. The Library of Congress Recommended Formats Statement can help with selecting formats. In the case of vector data, for example, there is not a broadly adopted open format, but Shapefile, while proprietary, is broadly adopted and openly documented. There are a variety of tools that can read Shapefiles which increases the likelihood that it will continue to be supported in some form.

These guidelines may also be relevant when considering preservation of data visualizations:
11. Use non-proprietary, broadly supported and adopted open file formats
57. Use alternative approaches for features that require communication with a server
64. Use meaningful file names and field names in your data, supply documentation